The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 yellowbrick==1.5 -q --user
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.2 yellowbrick==1.5 -q --user
# !pip install --upgrade -q jinja2
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# importing the sklearn libraries
from sklearn import metrics
from sklearn.cluster import AgglomerativeClustering,KMeans,DBSCAN
from sklearn.compose import ColumnTransformer
from category_encoders import TargetEncoder
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
# from scipy importing libraries
from scipy.cluster.hierarchy import dendrogram,linkage,cophenet
from yellowbrick.cluster import SilhouetteVisualizer
from scipy.spatial.distance import cdist,pdist
from sklearn.metrics import silhouette_score
## WARNING
import warnings
warnings.filterwarnings("ignore")
df=pd.read_csv("stock_data.csv")
df.head()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
data=df.copy()
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
Observation
data.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Ticker Symbol | 340 | 340 | AAL | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Security | 340 | 340 | American Airlines Group | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS Sector | 340 | 11 | Industrials | 53 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS Sub Industry | 340 | 104 | Oil & Gas Exploration & Production | 16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Current Price | 340.0 | NaN | NaN | NaN | 80.862345 | 98.055086 | 4.5 | 38.555 | 59.705 | 92.880001 | 1274.949951 |
| Price Change | 340.0 | NaN | NaN | NaN | 4.078194 | 12.006338 | -47.129693 | -0.939484 | 4.819505 | 10.695493 | 55.051683 |
| Volatility | 340.0 | NaN | NaN | NaN | 1.525976 | 0.591798 | 0.733163 | 1.134878 | 1.385593 | 1.695549 | 4.580042 |
| ROE | 340.0 | NaN | NaN | NaN | 39.597059 | 96.547538 | 1.0 | 9.75 | 15.0 | 27.0 | 917.0 |
| Cash Ratio | 340.0 | NaN | NaN | NaN | 70.023529 | 90.421331 | 0.0 | 18.0 | 47.0 | 99.0 | 958.0 |
| Net Cash Flow | 340.0 | NaN | NaN | NaN | 55537620.588235 | 1946365312.175789 | -11208000000.0 | -193906500.0 | 2098000.0 | 169810750.0 | 20764000000.0 |
| Net Income | 340.0 | NaN | NaN | NaN | 1494384602.941176 | 3940150279.327937 | -23528000000.0 | 352301250.0 | 707336000.0 | 1899000000.0 | 24442000000.0 |
| Earnings Per Share | 340.0 | NaN | NaN | NaN | 2.776662 | 6.587779 | -61.2 | 1.5575 | 2.895 | 4.62 | 50.09 |
| Estimated Shares Outstanding | 340.0 | NaN | NaN | NaN | 577028337.75403 | 845849595.417695 | 27672156.86 | 158848216.1 | 309675137.8 | 573117457.325 | 6159292035.0 |
| P/E Ratio | 340.0 | NaN | NaN | NaN | 32.612563 | 44.348731 | 2.935451 | 15.044653 | 20.819876 | 31.764755 | 528.039074 |
| P/B Ratio | 340.0 | NaN | NaN | NaN | -1.718249 | 13.966912 | -76.119077 | -4.352056 | -1.06717 | 3.917066 | 129.064585 |
Obseravtion
Ticker Symbol is the unique code of Security Security is also an unique containtGICS Sector is having 11 unique values GICS Sub Industr is have 104 unique values Current Price is slight skewed towards right Price change has the negitive values and the max value is 55.05 which has outliers in the data Volatility is slightly skwed toward left ROE is skewed toward left and has outliers Cash Ratio is skewed toward left and has outliers Net Cash Flow,Earnings Per Share,P/B Ratio,P/E Ratio data very hight velocity changes data.isnull().sum()
Ticker Symbol 0 Security 0 GICS Sector 0 GICS Sub Industry 0 Current Price 0 Price Change 0 Volatility 0 ROE 0 Cash Ratio 0 Net Cash Flow 0 Net Income 0 Earnings Per Share 0 Estimated Shares Outstanding 0 P/E Ratio 0 P/B Ratio 0 dtype: int64
def histogram_boxplot(data,feature,kde=False,figsize=(15,10)):
f,(ax_1,ax_2)=plt.subplots(2,1,figsize=figsize,sharex=True,gridspec_kw={'height_ratios':(0.25,0.15)})
sns.boxplot(data,x=feature,ax=ax_1,showmeans=True,color="red")
sns.histplot(data,x=feature,ax=ax_2,color="blue",kde=kde)
ax_1.axvline(data[feature].median(),color="yellow")
ax_2.axvline(data[feature].median(),color="green")
def labeled_plot(data,feature,perc=True,n=None):
count=data[feature].nunique()
total=len(data[feature])
if n is None:
plt.figure(figsize=(count+2,6))
else:
plt.figure(figsize=(n+2,6))
plt.xticks(rotation=90)
ax=sns.countplot(data=data,x=feature,palette="Paired",order=data[feature].value_counts().index[:n])
for p in ax.patches:
if perc is True:
label="{0:.2f}%".format(100*p.get_height()/total)
else:
label=p.get_height()
x=p.get_x()+p.get_width()/2
y=p.get_height()
plt.annotate(
label,
(x,y),
ha="center",
va="center",
xytext=(0,5),
textcoords="offset points")
1. What does the distribution of stock prices look like?
num_col=data.select_dtypes(exclude=np.object_).columns
for i in num_col:
histogram_boxplot(data,i,kde=True)
num_col
Index(['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share',
'Estimated Shares Outstanding', 'P/E Ratio', 'P/B Ratio'],
dtype='object')
Observation
Current Price is right skewed and the data has the outliers in upper whiskerPrice Change this is 13 weeks price change. There are some stock which has heavy downfall and heavy incremental Volatility There are some stocks with high volatilityROEA measure of financial performance calculated by dividing net income by shareholders' equity has more outliers in upper whiskersCash Ratio,
`Net Cash Flow`, `Net Income`, `Earnings Per Share`,
`Estimated Shares Outstanding`, `P/E Ratio`, `P/B Ratio` have more outliers in the data
labeled_plot(data,"GICS Sector",perc=True)
Obswrvation
labeled_plot(data,"GICS Sub Industry",perc=True,n=20)
2. The stocks of which economic sector have seen the maximum price increase on average?
for i,val in enumerate (num_col):
plt.figure(figsize=(i+2,6))
f,(ax)=plt.subplots(1,1,sharex=True,)
ax=sns.lineplot(data,x="GICS Sector",y=val)
plt.xticks(rotation=90)
<Figure size 200x600 with 0 Axes>
<Figure size 300x600 with 0 Axes>
<Figure size 400x600 with 0 Axes>
<Figure size 500x600 with 0 Axes>
<Figure size 600x600 with 0 Axes>
<Figure size 700x600 with 0 Axes>
<Figure size 800x600 with 0 Axes>
<Figure size 900x600 with 0 Axes>
<Figure size 1000x600 with 0 Axes>
<Figure size 1100x600 with 0 Axes>
<Figure size 1200x600 with 0 Axes>
Observation
Current-price vs GICS Sector Healthcare and consumer Discretionary has Uptrend compared to other Price change vs GICS Sector Energy has downtrend comapring to other sectors Volatility vs GICS Sector The Energy stocks has more volatility Roe vs GICS SectorEnergy and consumer Staples has more financial performance calculatedcash ration vs GICS Sector informatio Technology has more cash flow net cash flow vs GICS Sector the telecommunication has less clash flow remaining has the balanced trend net income vs GICS Sector The telecommunication has highest net income earnings per share vs GICS Sector - the energy stocks less earning per stock estimated share outstanding the telcome has the highest outstanding because of high netincomep/bratio vs p/eratio vs GICS Sector- the trends are slightly similar but infomational has slightly differnt 3. How are the different variables correlated with each other?
plt.figure(figsize=(10,8))
sns.heatmap(data.corr(numeric_only=True),cbar=True,annot=True);
Observation
plt.figure(figsize=(20,15))
sns.pairplot(data)
<seaborn.axisgrid.PairGrid at 0x3a91c39d0>
<Figure size 2000x1500 with 0 Axes>
data.isnull().sum()
Ticker Symbol 0 Security 0 GICS Sector 0 GICS Sub Industry 0 Current Price 0 Price Change 0 Volatility 0 ROE 0 Cash Ratio 0 Net Cash Flow 0 Net Income 0 Earnings Per Share 0 Estimated Shares Outstanding 0 P/E Ratio 0 P/B Ratio 0 dtype: int64
Observation
data.duplicated().sum()
0
Observation
# checking for the outliers
plt.figure(figsize=(20,15))
for i,val in enumerate (num_col):
plt.subplot(5,3,i+1)
sns.boxplot(data[val])
plt.xlabel(val)
Observation
# treating the outliers
def Treating_outlier(data,col):
for i in col:
q1=data[i].quantile(0.25)
q3=data[i].quantile(0.75)
iqr=q3-q1
lower_whisker=q1- 1.5*iqr
upper_whisker=q3-1.5*iqr
df[i]=np.clip(df[i],lower_whisker,upper_whisker)
return df
Treating_outlier(data,num_col)
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 11.392499 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 11.392499 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 11.392499 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 11.392499 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 11.392499 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335 | YHOO | Yahoo Inc. | Information Technology | Internet Software & Services | 11.392499 | 14.887727 | 1.845149 | 15 | 459 | -1032187000 | -4359082000 | -4.64 | 9.394573e+08 | 28.976191 | 6.261775 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 11.392499 | -8.698917 | 1.478877 | 142 | 27 | 159000000 | 1293000000 | 2.97 | 4.353535e+08 | 17.682214 | -3.838260 |
| 337 | ZBH | Zimmer Biomet Holdings | Health Care | Health Care Equipment | 11.392499 | 9.347683 | 1.404206 | 1 | 100 | 376000000 | 147000000 | 0.78 | 1.884615e+08 | 131.525636 | -23.884449 |
| 338 | ZION | Zions Bancorp | Financials | Regional Banks | 11.392499 | -1.158588 | 1.468176 | 4 | 99 | -43623000 | 309471000 | 1.20 | 2.578925e+08 | 22.749999 | -0.063096 |
| 339 | ZTS | Zoetis | Health Care | Pharmaceuticals | 11.392499 | 16.678836 | 1.610285 | 32 | 65 | 272000000 | 339000000 | 0.68 | 4.985294e+08 | 70.470585 | 1.723068 |
340 rows × 15 columns
plt.figure(figsize=(20,15))
for i,val in enumerate (num_col):
plt.subplot(5,3,i+1)
sns.boxplot(df[val])
plt.xlabel([val])
Observation
Target_data = data.copy()
for i in cat_name:
te=TargetEncoder()
te.fit(df[i],df["Earnings Per Share"])
value=te.transform(Target_data[i])
Target_data=pd.concat([Target_data,value],axis=1)
Target_data=Target_data.select_dtypes(exclude=np.object_)
Target_data
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | Ticker Symbol | Security | GICS Sector | GICS Sub Industry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 | 3.897330 | 3.897330 | 4.397028 | 3.966685 |
| 1 | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 | 2.825236 | 2.825236 | 4.330906 | 2.569945 |
| 2 | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 | 2.797913 | 2.797913 | 4.330906 | 3.032254 |
| 3 | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 | 2.579331 | 2.579331 | 2.375414 | 2.363639 |
| 4 | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 | 2.455728 | 2.455728 | 2.375414 | 2.823150 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335 | 33.259998 | 14.887727 | 1.845149 | 15 | 459 | -1032187000 | -4359082000 | -4.64 | 9.394573e+08 | 28.976191 | 6.261775 | 1.811691 | 1.811691 | 2.375414 | 2.373630 |
| 336 | 52.516175 | -8.698917 | 1.478877 | 142 | 27 | 159000000 | 1293000000 | 2.97 | 4.353535e+08 | 17.682214 | -3.838260 | 2.801817 | 2.801817 | 4.317254 | 3.536632 |
| 337 | 102.589996 | 9.347683 | 1.404206 | 1 | 100 | 376000000 | 147000000 | 0.78 | 1.884615e+08 | 131.525636 | -23.884449 | 2.516879 | 2.516879 | 4.330906 | 3.032254 |
| 338 | 27.299999 | -1.158588 | 1.468176 | 4 | 99 | -43623000 | 309471000 | 1.20 | 2.578925e+08 | 22.749999 | -0.063096 | 2.571525 | 2.571525 | 4.145112 | 2.640218 |
| 339 | 47.919998 | 16.678836 | 1.610285 | 32 | 65 | 272000000 | 339000000 | 0.68 | 4.985294e+08 | 70.470585 | 1.723068 | 2.503868 | 2.503868 | 4.330906 | 2.569945 |
340 rows × 15 columns
#standardscaling
Target_data=Target_data.drop(["Ticker Symbol","Security"],axis=1)
Target_data
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | GICS Sector | GICS Sub Industry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 | 4.397028 | 3.966685 |
| 1 | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 | 4.330906 | 2.569945 |
| 2 | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 | 4.330906 | 3.032254 |
| 3 | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 | 2.375414 | 2.363639 |
| 4 | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 | 2.375414 | 2.823150 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335 | 33.259998 | 14.887727 | 1.845149 | 15 | 459 | -1032187000 | -4359082000 | -4.64 | 9.394573e+08 | 28.976191 | 6.261775 | 2.375414 | 2.373630 |
| 336 | 52.516175 | -8.698917 | 1.478877 | 142 | 27 | 159000000 | 1293000000 | 2.97 | 4.353535e+08 | 17.682214 | -3.838260 | 4.317254 | 3.536632 |
| 337 | 102.589996 | 9.347683 | 1.404206 | 1 | 100 | 376000000 | 147000000 | 0.78 | 1.884615e+08 | 131.525636 | -23.884449 | 4.330906 | 3.032254 |
| 338 | 27.299999 | -1.158588 | 1.468176 | 4 | 99 | -43623000 | 309471000 | 1.20 | 2.578925e+08 | 22.749999 | -0.063096 | 4.145112 | 2.640218 |
| 339 | 47.919998 | 16.678836 | 1.610285 | 32 | 65 | 272000000 | 339000000 | 0.68 | 4.985294e+08 | 70.470585 | 1.723068 | 4.330906 | 2.569945 |
340 rows × 13 columns
Scaling = StandardScaler()
Scaling_df=Target_data.copy()
Scaling_data=Scaling.fit_transform(Scaling_df)
Scaling_data.shape
(340, 13)
Scaling_data_df=pd.DataFrame(Scaling_data,columns=Target_data.columns)
Scaling_data_df
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | GICS Sector | GICS Sub Industry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.393341 | 0.493950 | 0.272749 | 0.989601 | -0.210698 | -0.339355 | 1.554415 | 1.309399 | 0.107863 | -0.652487 | -0.506653 | 0.617320 | 0.964077 |
| 1 | -0.220837 | 0.355439 | 1.137045 | 0.937737 | 0.077269 | -0.002335 | 0.927628 | 0.056755 | 1.250274 | -0.311769 | -0.504205 | 0.589552 | -0.043671 |
| 2 | -0.367195 | 0.602479 | -0.427007 | -0.192905 | -0.033488 | 0.454058 | 0.744371 | 0.024831 | 1.098021 | -0.391502 | 0.094941 | 0.589552 | 0.289885 |
| 3 | 0.133567 | 0.825696 | -0.284802 | -0.317379 | 1.218059 | -0.152497 | -0.219816 | -0.230563 | -0.091622 | 0.947148 | 0.424333 | -0.231655 | -0.192520 |
| 4 | -0.260874 | -0.492636 | 0.296470 | -0.265515 | 2.237018 | 0.133564 | -0.202703 | -0.374982 | 1.978399 | 3.293307 | 0.199196 | -0.231655 | 0.139016 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335 | -0.486181 | 0.901646 | 0.540121 | -0.255142 | 4.308162 | -0.559673 | -1.487784 | -1.127481 | 0.429111 | -0.082116 | 0.572194 | -0.231655 | -0.185312 |
| 336 | -0.289510 | -1.065766 | -0.079703 | 1.062211 | -0.476513 | 0.053235 | -0.051186 | 0.029391 | -0.167741 | -0.337154 | -0.152012 | 0.583819 | 0.653794 |
| 337 | 0.221913 | 0.439539 | -0.206067 | -0.400362 | 0.332009 | 0.164889 | -0.342467 | -0.303532 | -0.460058 | 2.233634 | -1.589390 | 0.589552 | 0.289885 |
| 338 | -0.547053 | -0.436811 | -0.097813 | -0.369243 | 0.320933 | -0.051022 | -0.301171 | -0.239684 | -0.377852 | -0.222714 | 0.118680 | 0.511528 | 0.007032 |
| 339 | -0.336453 | 1.051046 | 0.142671 | -0.078803 | -0.055639 | 0.111378 | -0.293666 | -0.318734 | -0.092942 | 0.854902 | 0.246754 | 0.589552 | -0.043671 |
340 rows × 13 columns
sns.pairplot(Scaling_data_df)
<seaborn.axisgrid.PairGrid at 0x3bc3aba10>
Steps
for visulization we use the Silhouette_visulization #### what are the adavantage and disadantage of k-mean clustering
Pros:
clustering = np.arange(1,10)
Destortion=[]
for i in clustering:
kmean=KMeans(n_clusters=i)
kmean.fit(Scaling_data_df)
value = kmean.transform(Scaling_data_df)
Destortion.append(sum(np.min(cdist(Scaling_data_df,kmean.cluster_centers_),axis=-1))/Scaling_data_df.shape[0])
Destortion
[2.7845826746661433, 2.5246660015780837, 2.41633837543872, 2.3140064788007155, 2.267679631483186, 2.238257888535335, 2.165947554813278, 2.1137427183044433, 2.0885997172802746]
plt.plot(clustering,Destortion,"-bx")
[<matplotlib.lines.Line2D at 0x3a86d3250>]
Observation
cluster =np.arange(2,10)
silhouette=[]
for i in cluster :
model = KMeans(n_clusters=i,init="k-means++")
pre=model.fit_predict((Scaling_data_df))
silhouette.append(silhouette_score(Scaling_data_df,pre))
plt.plot(cluster,silhouette)
[<matplotlib.lines.Line2D at 0x3ae415a50>]
Observation
visulizer= SilhouetteVisualizer(KMeans(n_clusters=2,random_state=1))
visulizer.fit(Scaling_data_df)
visulizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 2 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
Observation
visulization = SilhouetteVisualizer(KMeans(n_clusters=3,random_state=1))
visulization.fit(Scaling_data_df)
visulization.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
Observation
visuization= SilhouetteVisualizer(KMeans(n_clusters=4,random_state=1))
visuization.fit(Scaling_data_df)
visuization.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
Observation
final_model = KMeans(n_clusters=2,random_state=1)
final_model.fit_transform(Scaling_data_df)
pre=final_model.fit_predict(Scaling_data_df)
silhou_score= silhouette_score(Scaling_data_df,pre)
silhou_score
0.5414366069113672
Kmean_clustering= Target_data.copy()
Kmean_clustering["Labels"]= final_model.labels_
kclustering=Kmean_clustering.groupby(["Labels"]).median()
kclustering.style.highlight_max()
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | GICS Sector | GICS Sub Industry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Labels | |||||||||||||
| 0 | 61.970001 | 5.422003 | 1.347239 | 15.000000 | 47.000000 | 3390000.000000 | 768996000.000000 | 3.030000 | 300243309.000000 | 19.777777 | -1.269332 | 4.145112 | 2.898097 |
| 1 | 32.560001 | -15.478079 | 2.692546 | 32.000000 | 38.000000 | -75150000.000000 | -2408948000.000000 | -6.070000 | 402141680.400000 | 93.089287 | 1.273530 | -4.303637 | -3.203810 |
kclustering=Kmean_clustering.groupby(["Labels"]).mean()
k_cluster_result=kclustering.style.highlight_max()
k_cluster_result
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | GICS Sector | GICS Sub Industry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Labels | |||||||||||||
| 0 | 84.040275 | 5.581787 | 1.425678 | 33.501577 | 71.501577 | 98893810.725552 | 1942408798.107255 | 3.791814 | 582012730.006404 | 28.859834 | -1.936359 | 3.428773 | 2.949350 |
| 1 | 37.062174 | -16.645247 | 2.908353 | 123.608696 | 49.652174 | -542023782.608696 | -4680557565.217391 | -11.214783 | 508330409.753913 | 84.334966 | 1.287871 | -3.988138 | -1.764493 |
Observation
perplexity = [2, 5, 10, 20, 30, 50, 100, 200]
plt.figure(figsize=(20, 10))
print(
"Visualizing the lower dimensional representation of data for different values of perplexity"
)
for i in range(len(perplexity)):
tsne = TSNE(n_components=2, perplexity=perplexity[i], n_jobs=-2, random_state=1)
# n_jobs specifies the number of parallel jobs to run
# -2 means using all processors except one
X_red = tsne.fit_transform(Kmean_clustering)
red_data_df = pd.DataFrame(data=X_red, columns=["Component 1", "Component 2"])
plt.subplot(2, int(len(perplexity) / 2), i + 1)
plt.title("perplexity=" + str(perplexity[i]))
sns.scatterplot(data=red_data_df, x="Component 1", y="Component 2", hue=Kmean_clustering["Labels"]),
plt.tight_layout(pad=2)
Visualizing the lower dimensional representation of data for different values of perplexity
plt.figure(figsize=(15,10))
for i,var in enumerate (num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(Kmean_clustering,x="Labels",y=var)
Scaling_data_df
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | GICS Sector | GICS Sub Industry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.393341 | 0.493950 | 0.272749 | 0.989601 | -0.210698 | -0.339355 | 1.554415 | 1.309399 | 0.107863 | -0.652487 | -0.506653 | 0.617320 | 0.964077 |
| 1 | -0.220837 | 0.355439 | 1.137045 | 0.937737 | 0.077269 | -0.002335 | 0.927628 | 0.056755 | 1.250274 | -0.311769 | -0.504205 | 0.589552 | -0.043671 |
| 2 | -0.367195 | 0.602479 | -0.427007 | -0.192905 | -0.033488 | 0.454058 | 0.744371 | 0.024831 | 1.098021 | -0.391502 | 0.094941 | 0.589552 | 0.289885 |
| 3 | 0.133567 | 0.825696 | -0.284802 | -0.317379 | 1.218059 | -0.152497 | -0.219816 | -0.230563 | -0.091622 | 0.947148 | 0.424333 | -0.231655 | -0.192520 |
| 4 | -0.260874 | -0.492636 | 0.296470 | -0.265515 | 2.237018 | 0.133564 | -0.202703 | -0.374982 | 1.978399 | 3.293307 | 0.199196 | -0.231655 | 0.139016 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335 | -0.486181 | 0.901646 | 0.540121 | -0.255142 | 4.308162 | -0.559673 | -1.487784 | -1.127481 | 0.429111 | -0.082116 | 0.572194 | -0.231655 | -0.185312 |
| 336 | -0.289510 | -1.065766 | -0.079703 | 1.062211 | -0.476513 | 0.053235 | -0.051186 | 0.029391 | -0.167741 | -0.337154 | -0.152012 | 0.583819 | 0.653794 |
| 337 | 0.221913 | 0.439539 | -0.206067 | -0.400362 | 0.332009 | 0.164889 | -0.342467 | -0.303532 | -0.460058 | 2.233634 | -1.589390 | 0.589552 | 0.289885 |
| 338 | -0.547053 | -0.436811 | -0.097813 | -0.369243 | 0.320933 | -0.051022 | -0.301171 | -0.239684 | -0.377852 | -0.222714 | 0.118680 | 0.511528 | 0.007032 |
| 339 | -0.336453 | 1.051046 | 0.142671 | -0.078803 | -0.055639 | 0.111378 | -0.293666 | -0.318734 | -0.092942 | 0.854902 | 0.246754 | 0.589552 | -0.043671 |
340 rows × 13 columns
# Z= linkage(Scaling_data_df,method="single",metric="euclidean")
# c,cophera= cophenet(Z,pdist(Scaling_data_df))
# c
distance_metrix=["euclidean","cityblock","chebyshev","mahalanobis"]
link_method=["single", "complete", "average", "weighted"]
highest_c=0
highest_dm_lm=[0,0]
for dm in distance_metrix:
for lm in link_method:
Z= linkage(Scaling_data_df,method=lm,metric=dm)
c,cophe=cophenet(Z,pdist(Scaling_data_df))
print("The {} link method, The {} Distance, The c value {}".format(lm.capitalize(),dm.capitalize(),c))
if highest_c< c:
highest_c=c
highest_dm_lm[0]=dm
highest_dm_lm[1]=lm
The Single link method, The Euclidean Distance, The c value 0.9286351974671321 The Complete link method, The Euclidean Distance, The c value 0.8997501863427066 The Average link method, The Euclidean Distance, The c value 0.9496161388743629 The Weighted link method, The Euclidean Distance, The c value 0.8998150165465131 The Single link method, The Cityblock Distance, The c value 0.9357621778480737 The Complete link method, The Cityblock Distance, The c value 0.644542272078475 The Average link method, The Cityblock Distance, The c value 0.9192022757302826 The Weighted link method, The Cityblock Distance, The c value 0.8647766042687631 The Single link method, The Chebyshev Distance, The c value 0.9079446457384741 The Complete link method, The Chebyshev Distance, The c value 0.8277914505636763 The Average link method, The Chebyshev Distance, The c value 0.9228451819279582 The Weighted link method, The Chebyshev Distance, The c value 0.8255276198299494 The Single link method, The Mahalanobis Distance, The c value 0.9333119306372156 The Complete link method, The Mahalanobis Distance, The c value 0.7303798022245401 The Average link method, The Mahalanobis Distance, The c value 0.9271533920266248 The Weighted link method, The Mahalanobis Distance, The c value 0.8845526840318471
link_method=["single", "complete", "average", "centroid", "ward", "weighted"]
highest_c=0
highest_dm_lm=[0,0]
for lm in link_method:
Z= linkage(Scaling_data_df,method=lm,metric="euclidean")
c,cophe=cophenet(Z,pdist(Scaling_data_df))
print("The {} link method, The euclidean Distance, The c value {}".format(lm.capitalize(),c))
if highest_c< c:
highest_c=c
highest_dm_lm[0]="euclidean"
highest_dm_lm[1]=lm
The Single link method, The euclidean Distance, The c value 0.9286351974671321 The Complete link method, The euclidean Distance, The c value 0.8997501863427066 The Average link method, The euclidean Distance, The c value 0.9496161388743629 The Centroid link method, The euclidean Distance, The c value 0.9556025941199036 The Ward link method, The euclidean Distance, The c value 0.7497369342417632 The Weighted link method, The euclidean Distance, The c value 0.8998150165465131
distance_metrix=["euclidean"]
link_method=["single", "complete", "average", "centroid", "ward", "weighted"]
fig,axs=plt.subplots(len(link_method),1,figsize=(30,40))
i=0
for dm in distance_metrix:
for lm in link_method:
Z= linkage(Scaling_data_df,method=lm,metric=dm)
axs[i].set_title("Distance metric: {}\nLinkage: {}".format(dm.capitalize(), lm))
c,cophe=cophenet(Z,pdist(Scaling_data_df))
dendrogram(Z,ax=axs[i])
axs[i].annotate(
f"Cophenetic\nCorrelation\n{c:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
i+=1
model =AgglomerativeClustering(n_clusters=3,linkage="ward",metric="euclidean")
model.fit(Scaling_data_df)
AgglomerativeClustering(metric='euclidean', n_clusters=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(metric='euclidean', n_clusters=3)
Agglomerative_final=Target_data.copy()
Agglomerative_final["Labels"]=model.labels_
Agglomerative_final
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | GICS Sector | GICS Sub Industry | Labels | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 | 4.397028 | 3.966685 | 0 |
| 1 | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 | 4.330906 | 2.569945 | 0 |
| 2 | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 | 4.330906 | 3.032254 | 0 |
| 3 | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 | 2.375414 | 2.363639 | 0 |
| 4 | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 | 2.375414 | 2.823150 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335 | 33.259998 | 14.887727 | 1.845149 | 15 | 459 | -1032187000 | -4359082000 | -4.64 | 9.394573e+08 | 28.976191 | 6.261775 | 2.375414 | 2.373630 | 0 |
| 336 | 52.516175 | -8.698917 | 1.478877 | 142 | 27 | 159000000 | 1293000000 | 2.97 | 4.353535e+08 | 17.682214 | -3.838260 | 4.317254 | 3.536632 | 0 |
| 337 | 102.589996 | 9.347683 | 1.404206 | 1 | 100 | 376000000 | 147000000 | 0.78 | 1.884615e+08 | 131.525636 | -23.884449 | 4.330906 | 3.032254 | 0 |
| 338 | 27.299999 | -1.158588 | 1.468176 | 4 | 99 | -43623000 | 309471000 | 1.20 | 2.578925e+08 | 22.749999 | -0.063096 | 4.145112 | 2.640218 | 0 |
| 339 | 47.919998 | 16.678836 | 1.610285 | 32 | 65 | 272000000 | 339000000 | 0.68 | 4.985294e+08 | 70.470585 | 1.723068 | 4.330906 | 2.569945 | 0 |
340 rows × 14 columns
Agglomerative_final_result=Agglomerative_final.groupby(["Labels"]).mean().style.highlight_max()
Agglomerative_final_result
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | GICS Sector | GICS Sub Industry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Labels | |||||||||||||
| 0 | 84.174131 | 5.252062 | 1.457107 | 36.270701 | 71.627389 | 184131251.592357 | 1435572471.337580 | 3.438519 | 470095931.074936 | 30.234034 | -1.673635 | 3.297753 | 2.918813 |
| 1 | 37.792353 | -18.180045 | 3.034472 | 108.764706 | 46.588235 | -680495411.764706 | -4489117117.647058 | -9.797059 | 440874880.167059 | 85.554147 | -0.087977 | -3.876786 | -2.723530 |
| 2 | 46.672222 | 5.166566 | 1.079367 | 25.000000 | 58.333333 | -3040666666.666667 | 14848444444.444445 | 3.435556 | 4564959946.222222 | 15.596051 | -6.354193 | 2.844955 | 2.683710 |
Agglomerative_final["Labels"].value_counts()
Labels 0 314 1 17 2 9 Name: count, dtype: int64
perplexity = [2, 5, 10, 20, 30, 50, 100, 200]
plt.figure(figsize=(20, 10))
print(
"Visualizing the lower dimensional representation of data for different values of perplexity"
)
for i in range(len(perplexity)):
tsne = TSNE(n_components=2, perplexity=perplexity[i], n_jobs=-2, random_state=1)
# n_jobs specifies the number of parallel jobs to run
# -2 means using all processors except one
X_red = tsne.fit_transform(Agglomerative_final)
red_data_df = pd.DataFrame(data=X_red, columns=["Component 1", "Component 2"])
plt.subplot(2, int(len(perplexity) / 2), i + 1)
plt.title("perplexity=" + str(perplexity[i]))
sns.scatterplot(data=red_data_df, x="Component 1", y="Component 2", hue=Agglomerative_final["Labels"]),
plt.tight_layout(pad=2)
Visualizing the lower dimensional representation of data for different values of perplexity
plt.figure(figsize=(15,10))
for i,var in enumerate (num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(Agglomerative_final,x="Labels",y=var)
You compare several things, like:
The Agglomerative Technique took less time to get number of cluster, Where as the in K-Mean the determining the k values is slightly challenge, But the k-means has its's advantage to performance in the any data set.
Which clustering technique gave you more distinct clusters, or are they the same?
The distance for the k-mean use the silhouette score 0.5414366069113672 and use the euclidean as the distance where as the Hierarchical use the different distance matrix The final result that we got is euclidean distance and the ward as the linkage 0.7497369342417632
How many observations are there in the similar clusters of both algorithms?
For the k-mean cluster we got 2 cluster whereas for the Hierachical cluster with ward linkage we got 3 cluster but with single linkage we got 2 cluster
How many clusters are obtained as the appropriate number of clusters from both algorithms?
k_cluster_result
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | GICS Sector | GICS Sub Industry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Labels | |||||||||||||
| 0 | 84.040275 | 5.581787 | 1.425678 | 33.501577 | 71.501577 | 98893810.725552 | 1942408798.107255 | 3.791814 | 582012730.006404 | 28.859834 | -1.936359 | 3.428773 | 2.949350 |
| 1 | 37.062174 | -16.645247 | 2.908353 | 123.608696 | 49.652174 | -542023782.608696 | -4680557565.217391 | -11.214783 | 508330409.753913 | 84.334966 | 1.287871 | -3.988138 | -1.764493 |
Agglomerative_final_result
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | GICS Sector | GICS Sub Industry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Labels | |||||||||||||
| 0 | 84.174131 | 5.252062 | 1.457107 | 36.270701 | 71.627389 | 184131251.592357 | 1435572471.337580 | 3.438519 | 470095931.074936 | 30.234034 | -1.673635 | 3.297753 | 2.918813 |
| 1 | 37.792353 | -18.180045 | 3.034472 | 108.764706 | 46.588235 | -680495411.764706 | -4489117117.647058 | -9.797059 | 440874880.167059 | 85.554147 | -0.087977 | -3.876786 | -2.723530 |
| 2 | 46.672222 | 5.166566 | 1.079367 | 25.000000 | 58.333333 | -3040666666.666667 | 14848444444.444445 | 3.435556 | 4564959946.222222 | 15.596051 | -6.354193 | 2.844955 | 2.683710 |
Cluster0
pca =PCA(n_components=13,random_state=1)
model=pca.fit_transform(Scaling_data_df)
checking_pca = pd.DataFrame(model,columns=Scaling_data_df.columns)
plt.step(np.arange(0,13),np.cumsum(pca.explained_variance_ratio_))
[<matplotlib.lines.Line2D at 0x39c507c90>]
pca =PCA(n_components=2,random_state=1)
model=pca.fit_transform(Scaling_data_df)
checking_pca = pd.DataFrame(model,columns=["comp_1","comp_2"])
plt.step(np.arange(0,2),np.cumsum(pca.explained_variance_ratio_))
[<matplotlib.lines.Line2D at 0x3987e94d0>]
np.cumsum(pca.explained_variance_ratio_)
array([0.27096417, 0.39341337])
sns.scatterplot(
data=checking_pca,
x="comp_1",
y="comp_2",
hue=Agglomerative_final["Labels"],
palette="rainbow",
)
plt.legend(bbox_to_anchor=(1, 1))
<matplotlib.legend.Legend at 0x3987ec2d0>
perplexity = [2, 5, 10, 20, 30, 50, 100, 200]
plt.figure(figsize=(20, 10))
print(
"Visualizing the lower dimensional representation of data for different values of perplexity"
)
for i in range(len(perplexity)):
tsne = TSNE(n_components=2, perplexity=perplexity[i], n_jobs=-2, random_state=1)
# n_jobs specifies the number of parallel jobs to run
# -2 means using all processors except one
X_red = tsne.fit_transform(checking_pca)
red_data_df = pd.DataFrame(data=X_red, columns=["Component 1", "Component 2"])
plt.subplot(2, int(len(perplexity) / 2), i + 1)
plt.title("perplexity=" + str(perplexity[i]))
sns.scatterplot(data=red_data_df, x="Component 1", y="Component 2", hue=Agglomerative_final["Labels"]),
plt.tight_layout(pad=2)
Visualizing the lower dimensional representation of data for different values of perplexity
Observation
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV,cross_val_score,train_test_split,StratifiedKFold
from sklearn.ensemble import AdaBoostClassifier,BaggingClassifier,GradientBoostingClassifier,RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from xgboost import XGBClassifier
X=Kmean_clustering.drop("Labels",axis=1)
Y=Kmean_clustering["Labels"]
x_train,x_test,y_train,y_test=train_test_split(X,Y,random_state=1,stratify=Y,test_size=0.30)
stand_scale=StandardScaler()
x_train_scaling=stand_scale.fit_transform(x_train)
x_test_scaling=stand_scale.transform(x_test)
model =[]
model.append(("Dtree",DecisionTreeClassifier()))
model.append(("Rforest",RandomForestClassifier(random_state=1)))
model.append(("Bagging",BaggingClassifier(random_state=1)))
model.append(("Adaboost",AdaBoostClassifier(random_state=1)))
model.append(("Gradient",GradientBoostingClassifier(random_state=1)))
model.append(("Xgboost",XGBClassifier(random_state=1)))
name=[]
result=[]
score=[]
# passing the model in the grid search
kfold=StratifiedKFold(n_splits=5,random_state=1,shuffle=True)
ssc_score="recall"
for names,models in model:
cross_value=cross_val_score(models,x_train_scaling,y_train,cv=kfold,scoring=ssc_score)
result.append(cross_value)
name.append(names)
print(names,cross_value.mean())
for names,models in model:
models.fit(x_train_scaling,y_train)
scores=metrics.recall_score(y_test,models.predict(x_test_scaling))
score.append(scores)
print(names,scores)
Dtree 0.9 Rforest 0.95 Bagging 0.95 Adaboost 0.95 Gradient 0.8833333333333334 Xgboost 0.8833333333333332 Dtree 0.7142857142857143 Rforest 0.8571428571428571 Bagging 0.7142857142857143 Adaboost 0.5714285714285714 Gradient 0.7142857142857143 Xgboost 0.7142857142857143
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(result);
ax.set_xticklabels(name);
sum(y_train==1),sum(y_train==0)
(16, 222)
os=SMOTE(k_neighbors=5,random_state=1,sampling_strategy=True)
x_train_over,y_train_over=os.fit_resample(x_train_scaling,y_train)
sum(y_train_over==1),sum(y_train_over==0)
(222, 222)
result_1=[]
score_1=[]
for names,models in model:
cross_scale=cross_val_score(models,x_train_over,y_train_over,cv=kfold,scoring=ssc_score)
print(names,cross_scale.mean())
result_1.append(cross_over)
for names,models in model:
models.fit(x_train_over,y_train_over)
score_val=metrics.recall_score(y_test,models.predict(x_test_scaling))
print(names,score_val)
score_1.append(score_val)
Dtree 0.9910101010101011 Rforest 0.9911111111111112 Bagging 0.9911111111111112 Adaboost 1.0 Gradient 0.9954545454545455 Xgboost 0.9911111111111112 Dtree 0.7142857142857143 Rforest 0.8571428571428571 Bagging 0.7142857142857143 Adaboost 0.5714285714285714 Gradient 0.7142857142857143 Xgboost 0.7142857142857143
plt.boxplot(result_1);
estimator = GradientBoostingClassifier(random_state=1)
parameter={
"loss":["log_loss"],
"learning_rate":[0.1,0.01,0.001],
"n_estimators":np.arange(100,200,20),
"min_samples_split":[2,3,4],
"min_weight_fraction_leaf":[0.0,0.001],
"max_depth":np.arange(3,10,1),
"n_estimators":np.arange(80,150,10)
}
kfold=StratifiedKFold(n_splits=5,random_state=1,shuffle=True)
Grid_cv= GridSearchCV(estimator,parameter,cv=kfold,scoring=ssc_score)
Grid_cv.fit(x_train_over,y_train_over)
best_fit=Grid_cv.best_estimator_
best_fit.fit(x_test_scaling,y_test)
GradientBoostingClassifier(learning_rate=0.01, max_depth=7, n_estimators=80,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=0.01, max_depth=7, n_estimators=80,
random_state=1)def checking_performace(model,predictor,target):
prec=model.predict(predictor)
recall=metrics.recall_score(target,prec)
precision=metrics.precision_score(target,prec)
f1_score=metrics.f1_score(target,prec)
accuracy=metrics.accuracy_score(target,prec)
result =pd.DataFrame({
"accuracy":accuracy,
"recall":recall,
"precision":precision,
"f1_score":f1_score
},index=[0])
return result
checking_performace(best_fit,x_train_over,y_train_over)
| accuracy | recall | precision | f1_score | |
|---|---|---|---|---|
| 0 | 0.777027 | 0.558559 | 0.992 | 0.714697 |
checking_performace(best_fit,x_test_scaling,y_test)
| accuracy | recall | precision | f1_score | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
def confution_matrix(model,prediction,target):
prec=model.predict(prediction)
cm=metrics.confusion_matrix(target,prec)
label=np.asarray([
["{0:0.2f}".format(item)+"\n{0:0.2%}".format(item/cm.flatten().sum())]
for item in cm.flatten()
]).reshape(2,2)
sns.heatmap(cm,annot=label,fmt="")
confution_matrix(best_fit,x_train_over,y_train_over)
confution_matrix(best_fit,x_test_scaling,y_test)